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Introduction 

One  of  the  major  obstacles  to  parallel  compilation  is  the 
question  of  how  to  automatically  identify  and  exploit 
the  underlying  parallelism  inherent  in  a  program.  We 
have  implemented  a  parallel  compiler  that  uses  novel 
techniques  to  detect  and  effectively  utilize  the  fine¬ 
grained  parallelism  that  is  inherent  in  many  numerically- 
intensive  scientific  computations.  Our  approach  differs 
from  the  current  fashion  in  parallel  compilation,  in  that 
rather  than  relying  on  the  structure  of  the  program  to  de¬ 
tect  locality  and  parallelism,  we  use  partial  evaluation[5] 
to  remove  most  loops  and  high-level  data  structure  ma¬ 
nipulations,  producing  a  low-level  program  that  exposes 
all  of  the  parallelism  inherent  in  the  underlying  numeri¬ 
cal  computation.  We  then  use  an  operation  aggregat¬ 
ing  technique  to  increase  the  grain-size  of  this  paral¬ 
lelism  to  match  the  communication  characteristics  of  the 
target  parallel  architecture.  This  approach,  which  was 
used  to  implement  the  compiler  for  the  Supercomputer 
Toolkit  parallel  computer[l],  has  proven  highly  effective 
for  an  important  class  of  numerically-oriented  scientific 
problems. 

Our  approach  to  compilation  is  specifically  tailored 
to  produce  efficient  statically  scheduled  code  for  par¬ 
allel  architectures  which  suffer  from  serious  inter¬ 
processor  communication  latency  and  bandwidth  limi¬ 
tations.  For  instance,  on  the  eight  processor  Supercom¬ 
puter  Toolkit  system  in  operation  at  M.I.T.,  six  cycles^ 
are  required  before  a  value  computed  by  c^ne  processor  is 
available  for  use  by  another,  while  bandwidth  limitations 
allow  only  one  value  out  of  every  eight  values  produced 
to  be  transmitted  among  the  processors.  Despite  these 
limitations,  code  produced  by  our  compiler  ^or  an  impor¬ 
tant  astrophysics  application^  runs  6.2  times  faster  on 
an  eight-processor  system  than  does  near-optimal  code 
produced  for  a  uniprocessor  system.^ 

Interprocessor  communication  latency  and  bandwidth 
limitations  pose  severe  obstacles  to  the  effective  use  of 
multiple  processors.  High  communication  latency  re¬ 
quires  that  there  be  enough  parallelism  available  to  al¬ 
low  each  processor  to  continue  to  initiate  operations  even 
while  waiting  for  results  produced  elsewhere  to  arrive.** 
Limited  communication  bandwidth  severely  restricts  the 


'  A  “cycle"  corresponds  to  the  time  required  to  perform  a 
floating-point  multiply  or  addition  operation. 

^Stormer  integration  of  the  9-body  gravitational  attrac¬ 
tion  problem 

*The  code  produced  for  the  uniprocessor  was  also  partially 
evaluated,  to  ensure  that  the  f2M:tor  of  6.2  speedup  is  entirely 
due  to  parallel  execution. 

*[5]  (page  35)  describes  how  the  effect  that  interprocessor 
communication  latency  has  on  available  parallelism  is  similar 
to  that  of  increasing  the  length  of  an  individual  processor’s 
pipeline.  In  order  to  continue  to  initiate  instructions  on  a 
heavily  pipelined  processor,  there  must  be  operations  avail¬ 
able  that  do  not  depend  on  results  that  have  not  yet  emerged 
from  the  processor  pipeline.  Similarly,  in  order  to  continue  to 
initiate  instructions  on  a  parallel  machine  that  suffers  from 
high  communication  latency,  there  must  be  operations  avail¬ 
able  that  do  not  depend  on  results  that  have  not  yet  been 
received. 


parallelism  grain-size  that  may  be  utilized  by  requir¬ 
ing  that  most  values  used  by  a  processor  be  produced 
on  that  processor,  rather  than  being  received  from  an¬ 
other  processor.  We  overcome  these  obstacles  by  com¬ 
bining  partial  evaluation,  which  exposes  large  amounts 
of  extremely  fine-grained  parallelism,  with  an  operation 
aggregating  technique  that  increases  the  grain-size  of 
the  operations  being  scheduled  for  parallel  execution  to 
match  the  communication  capabilities  of  the  target  ar¬ 
chitecture. 

Our  Approach 

We  use  partial  evaluation  to  eliminate  the  barriers  to 
parallel  execution  imposed  by  the  data  representations 
and  control  structure  of  a  high-level  program.  Par¬ 
tial  evaluation  is  particularly  effective  on  numerically- 
oriented  scientific  programs  since  these  programs  tend 
to  be  mostly  data-independent,  meaning  that  they  con¬ 
tain  large  regions  in  which  the  operations  to  be  per¬ 
formed  do  not  depend  on  the  numerical  values  of  the 
data  being  manipulated.®  As  a  result  of  this  data- 
independence,  partial  evaluation  is  able  to  perform  in 
advance,  at  compile  time,  most  data  structure  refer¬ 
ences,  procedure  calls,  and  conditional  branches  related 
to  data  structure  size,  leaving  only  the  underlying  nu¬ 
merical  computations  to  be  performed  at  run  time.  The 
underlying  numerical  computations  form  huge  sequences 
of  purely  numerical  code,  known  as  basic-blocks.  Of¬ 
ten,  these  basic-blocks  contain  several  thousand  instruc¬ 
tions.  The  order  in  which  basic-blocks  are  invoked  is 
determined  by  data-dependent  conditional  branches  and 
looping  constructs. 

We  schedule  the  partially  evaluated  program  for  par¬ 
allel  execution  primarily  by  performing  the  operations 
within  an  individual  basic-block  in  parallel.  This  is  prac¬ 
tical  only  because  the  basic  blocks  produced  by  partial 
evaluation  are  so  large.  Were  it  not  for  partial  evalua¬ 
tion,  the  basic-blocks  would  be  two  orders  of  magnitude 
smaller,  requiring  the  use  of  techniques  such  as  software 
pipelining  and  trace  scheduling,  that  seek  to  overlap  the 
execution  of  multiple  basic-blocks.  Executing  a  huge 
basic-block  in  parallel  is  very  attractive  since  it  is  clear 
at  compile  time  which  operations  need  to  be  performed, 
which  results  they  depend  on,  and  how  much  computa¬ 
tion  each  instruction  will  require,  ensuring  the  effective¬ 
ness  of  static  scheduling  techniques.  In  contrast,  par¬ 
allelizing  a  program  by  executing  multiple  basic-blocks 
simultaneously  requires  guessing  the  direction  that  con¬ 
ditional  branches  will  take,  how  many  times  a  particular 
basic-block  may  be  executed,  and  how  large  the  data 
structures  will  be. 

Our  approach  of  combining  partial  evaluation  with 
parallelism  grain  size  selection  was  used  to  implement 
the  compiler  for  the  Supercomputer  Toolkit  parallel 
processor.®  [1]  The  Toolkit  compiler  operates  in  four  ma- 

*For  instance,  matrix  multiply  performs  the  same  set  of 
operations,  regardless  of  the  particular  numerical  values  of 
the  matrix  elements. 

®See  Appendix  for  a  brief  overview  of  the  architecture  of 
the  Supercomputer  Toolkit  parallel  processor" 


Figure  1:  Four  phase  compilation  process  that  produces  par¬ 
allel  object  code  from  Scheme  source  code. 

jor  phi^ses,  as  slmwn  in  Figure  1.  The  first  phase  per¬ 
forms  partial  evaluation,  followed  by  traditional  com¬ 
piler  optimizations,  such  as  constant  folding  and  dead- 
code  elimination.  The  second  phase  analyzes  locality 
constraints  within  each  basic-block,  locating  operations 
that  depend  so  closely  on  one  another  that  it  is  clearly 
desirable  that  they  be  computed  on  the  same  processor. 
Closely  related  operations  are  grouped  together  to  form 
a  higher  grain-size  instruction,  known  as  a  region.  The 
third  compilation  phase  uses  heuristic  scheduling  tech¬ 
niques  to  assign  each  region  to  a  processor.  The  final 
phase  schedules  the  individual  operations  for  execution 
within  each  processor,  accounting  for  pipelining,  mem¬ 
ory  access  restrictions,  register  allocation,  and  final  allo¬ 
cation  of  the  inter-processor  communication  pathways. 

The  Partial  Evaluator 

Partial  evaluation  converts  a  high-level,  abstractly  writ¬ 
ten,  general  purpose  program  into  a  low-level  program 
that  is  specialized  for  the  particular  application  at  hand. 
For  instance,  a  program  that  computes  force  interactions 
among  a  system  of  N  particles  might  be  specialized  to 
compute  the  gravitational  interactions  among  5  plan¬ 
ets  of  our  particular  solar  system.  This  specialization 
is  achieved  by  performing  in  advance,  at  compile  time, 
all  operations  that  do  not  depend  explicitly  on  the  ac¬ 
tual  numerical  values  of  the  data.  Many  data-structure 
references,  procedure  calls,  conditional  branches,  table 
lookups,  loop  iterations,  and  even  some  numerical  oper¬ 
ations  may  be  performed  in  advance,  at  compile  time, 
leaving  only  the  underlying  numerical  operations  to  be 
performed  at  run  time. 

The  Toolkit  compiler  performs  partial  evaluation  us¬ 
ing  the  symbolic  execution  technique  described  in  [4]. 
The  partial  evaluator  takes  as  input  the  program  to  be 
compiled,  as  well  as  the  input  data-structures  associ¬ 
ated  with  a  particular  application.  Some  numerical  val¬ 
ues  within  the  input  data-structures  will  not  be  avail¬ 
able  at  compile  time;  these  missing  numerical  values  are 


HIGH-LEVEL  PROGiUN: 

(d«fin«  (square  x)  («  x  x)) 

(define  (sun-of-sqoares  L) 

(apply  ■*-  (nap  square  L))) 

(define  input-data 
(list 

(nake-placeholder 

’floating-point)  ;; placebo Ider  tl 
(nake-placeholder 

’floating-point)  ;; placeholder  #2 

3.14)) 

(partial-evaluate  (sun-of-squares  input-data)) 

PARTI ALLT-EVALUATEO  PROGRAM: 

(IIPUT  1)  ;:nunerical  value  for  placeholder  tl 
(IHPUT  2)  ; inunerical  value  for  placeholder  #2 

(ASSIGR  3 

(floating-point-nultiply  (FETCH  1)  (FETCH  1))) 
(ASSIGN  4 

(floating-point-nultiply  (FETCH  2)  (FETCH  2))) 
(ASSIGN  5 

(floating-point-add  (FETCH  3)  (FETCH  4)  9.8596)) 
(RESULT  5) 


Figure  2:  Partial  evaluation  of  the  sum-of-squares  pro¬ 
gram,  for  an  application  where  the  input  is  known  to  be 
a  list  of  three  floating-point  numbers,  the  last  of  which 
is  always  3.14.  Notice  how  the  squaring  of  3.14  to  pro¬ 
duce  9.8596  took  place  at  compile  time,  and  how  all  list- 
manipulation  operations  have  been  eliminated. 

represented  by  a  data-structure  known  as  a  placeholder. 
The  data-independent  portions  of  the  program  are  exe¬ 
cuted  symbolically  at  compile-time,  allowing  all  opera¬ 
tions  that  do  not  depend  on  missing  numerical  values  to 
be  performed  in  advance,  leaving  only  the  lowest-level 
numerical  operations  to  be  performed  at  runtime.  This 
process  is  illustrated  in  Figure  2,  which  shows  the  result 
of  partially  evaluating  a  simple  sum-of-squares  program. 

Although  partial  evaluation  is  highly  effective  on  the 
data-independent  portions  of  a  program,  data-dependent 
conditional  branches  pose  a  serious  obstacle.  Data- 
dependent  conditional  branches  interrupt  the  flow  of 
compile-time  execution,  since  it  will  not  be  known  until 
runtime  which  branch  of  the  conditional  should  be  exe¬ 
cuted.  Fortunately,  most  numerical  programs  consist  of 
large  sequences  of  data-independent  code,  separated  by 
occasional  data-dependent  conditional  branches.^  We 
partially  evaluate  each  data-independent  segment  of  a 

^Some  typical  uses  of  data-dependent  branches  in  scien¬ 
tific  programs  are  to  check  for  convergence,  or  to  examine  the 
accnmulated  error  when  varying  the  step-size  of  a  nnmerical 
integrator.  These  uses  usually  occur  after  a  long  sequence  of 
data-independent  code.  Indeed,  the  only  significant  excep¬ 
tion  to  this  nsage  pattern  that  we  have  encountered  is  when 


piogr&m,  le&ving  intact  the  datvdependent  branches 
that  glue  the  data-independent  segments  together*  In 
this  way,  each  data-independent  program  segment  is  con¬ 
verted  into  a  sequence  of  purely  numerical  operations, 
forming  a  huge  basic-block  that  contains  a  large  amount 
of  hue-grain  parallelism. 

Exposing  Fine-Grain  Parallelism 

Each  basic-block  produced  by  partial  evaluation  may 
be  represented  as  a  data-independent  (static)  data-flow 
graph  whose  operators  are  all  low-level  numerical  opera¬ 
tions.  Previous  work  has  shown  that  this  graph  contains 
large  amounts  of  low-level  parallelism.  For  instance,  as 
illustrated  in  Figure  3,  a  parallelism  profile  analysis  of 
the  9-body  gravitational  attraction  problem^  indicates 
that  partial-evaluation  exposed  so  much  low-level  paral¬ 
lelism  that  in  theory,  parallel  execution  could  speed  up 
the  computation  by  a  factor  of  69  times  faster  than  a 
uniprocessor  execution. 

Achieving  the  theoretical  speedup  factor  of  69  for  the 
9- body  problem  would  require  using  516  non-pipelined 
processors  capable  of  instantaneous  communication  with 
one  another.  In  practice,  much  of  the  available  paral¬ 
lelism  must  be  used  to  keep  processor  pipelines  full,  and 
it  does  take  time  (latency)  to  communicate  between  pro¬ 
cessors.  As  the  latency  of  inter-processor  communication 
increases,  the  maximum  possible  speedup  decreases,  as 
some  of  the  parallelism  must  be  used  to  keep  each  pro¬ 
cessor  busy  while  awaiting  the  arrival  of  results  from 
neighboring  processors.  Communication  bandwidth  lim¬ 
itations  further  restrict  how  parallelism  may  be  used  by 
requiring  that  most  values  used  by  a  processor  actually 
be  produced  by  that  processor. 

a  matrix  solver  examines  the  numerical  values  of  the  ma¬ 
trix  elements  in  order  to  choose  the  best  elements  to  use  as 
pivots.  [3]  describes  additional  techniques  for  partially  eval¬ 
uating  data-dependent  branches,  such  as  generating  different 
compiled  code  for  each  possible  branch  direction,  and  then 
choosing  at  run-time  which  set  of  code  to  execute.  Although 
techniques  of  this  sort  can  not  overcome  large-scale  control 
flow  changes,  they  have  proven  quite  effective  for  dealing  with 
localized  branches  such  as  those  associated  with  the  selection 
operators  MIN,  MAX,  and  ABS,  as  well  as  with  piecewise  de¬ 
fined  equations. 

*The  partial  evaluation  phase  of  our  compiler  is  currently 
not  very  well  automated,  requiring  that  the  programmer  pro¬ 
vide  the  compiler  with  a  set  of  input  data-structures  for  each 
data-independent  code  sequence,  as  if  the  data-independent 
sequences  are  seperate  programs  being  glued  together  by  the 
data-dependent  conditional  branches.  This  manual  interface 
to  the  partial  evaluator  is  somewhat  of  an  implementation 
quirk;  there  is  no  reason  that  it  could  not  be  more  automated. 
Indeed,  several  Supercomputer  Toolkit  users  have  built  code 
generation  systems  on  top  of  our  compiler  that  automati¬ 
cally  generate  complete  programs,  including  data-dependent 
conditionals,  invoking  the  partial  evaluator  to  optimize  the 
data-independent  portions  of  the  program.  Recent  work  by 
Weise,  Ruf,  and  Katz([19],[20],[13])  describes  additional  tech¬ 
niques  for  automating  the  partial  evaluation  process  across 
data-dependent  branches. 

*  Specifically,  one  time-step  of  a  12th-order  Stormer  in¬ 
tegration  of  the  gravity-induced  motion  of  a  9-body  solar 
system. 
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Figure  3:  Parallelism  profile  of  the  9-body  problem.  This 
graph  represents  all  of  the  parallelism  available  in  the 
problem,  taking  into  account  the  varying  latency  of  nu¬ 
merical  operations. 

Grain  Size  vs.  Bandwidth 

We  have  found  that  bandwidth  limitations  make  it  im¬ 
practical  to  use  critical-path  based  scheduling  techniques 
to  spread  fine-grain  parallelism  across  multiple  proces¬ 
sors.  In  the  latency-limited  case  investigated  by  Berlin 
and  Weise  [5],  it  is  feasible  to  schedule  a  fine-grain  op¬ 
eration  for  parallel  execution  whenever  there  is  suffi¬ 
cient  time  for  the  operands  to  arrive  at  the  processor 
doing  the  computing,  2Uid  for  the  result  to  be  trans¬ 
mitted  to  its  consumers.  Hence  it  is  practical  to  assign 
non  critical-path  operations  to  any  available  processor. 
Bandwidth  limitations  destroy  this  option  by  limiting 
the  number  of  values  that  may  be  transmitted  between 
processors,  thereby  forcing  operations  that  could  oth¬ 
erwise  have  been  computed  elsewhere  to  be  computed 
on  the  processor  that  is  the  ultimate  consumer  of  their 
results.  Indeed,  on  the  Supercomputer  Toolkit  archi¬ 
tecture,  which  suffers  from  both  latency  and  bandwidth 
limitations,  heuristic  techniques  similar  to  those  used  by 
Berlin  and  Weise  achieved  a  dismal  speedup  factor  of 
only  2.5  using  8  processors.  One  possible  solution  to  the 
bandwidth  problem  is  to  modify  the  critical-path  based 
scheduling  approach  to  make  a  much  more  careful  and 
computationally  expensive  decision  regarding  which  re¬ 
sults  may  be  transmitted  between  processors,  and  which 
processor  a  particular  result  should  be  computed  in.  Al¬ 
though  this  modification  could  be  achieved  by  adding  a 
backtracking  heuristic  that  searched  for  different  ways 
of  assigning  each  fine-grain  instruction  to  a  processor,’* 

’"Indeed,  one  possibility  would  be  to  design  the  bscktrsck- 


this  optimization  based  approach  seems  computationally 
prohibitive  for  use  on  the  huge  basic-blocks  produced  by 
partial  evaluation. 

Adjusting  the  Grain-Size 

Rather  than  extending  the  critical-path  based  approach 
to  handle  bandwidth  limitations  by  searching  for  a  glob¬ 
ally  acceptable  fine-grain  scheduling  solution,  we  seek  to 
hide  the  bandwidth  limitation  by  increasing  the  grain- 
size  of  the  operations  being  scheduled.  Prior  to  initiat¬ 
ing  critical-path  based  scheduling,  we  perform  a  local¬ 
ity  analysis  that  groups  together  operations  that  depend 
so  closely  on  one  other  that  it  would  not  be  practical 
to  place  them  in  different  processors.  Each  group  of 
closely  interdependent  operations  forms  a  larger  grain- 
size  instruction,  which  we  refer  to  as  a  region."  Some 
regions  will  be  large,  while  others  may  be  as  small  as 
one  fine-grain  instruction,  in  essence,  grouping  opera¬ 
tions  together  to  form  a  region  is  a  way  of  simplifying 
the  scheduling  process  by  deciding  in  advance  that  cer¬ 
tain  opportunities  for  parallel  execution  will  be  ignored 
due  to  limited  communication  capabilities. 

Since  all  operations  within  a  region  are  guaranteed  to 
be  scheduled  onto  the  same  processor,  the  maximum  re¬ 
gion  size  must  be  chosen  to  match  the  communication 
capabilities  of  the  target  architecture.  For  instance,  if 
regions  are  permitted  to  grow  too  large,  a  single  region 
might  encompass  the  entire  data-flow  graph,  forcing  the 
entire  computation  to  be  performed  on  a  single  proces¬ 
sor!  Although  strict  limits  are  therefore  placed  on  the 
maximum  size  of  a  region,  regions  need  not  be  of  uniform 
size.  Indeed,  some  regions  will  be  large,  corresponding 
to  localized  computation  of  intermediate  results,  while 
other  regions  will  be  quite  small,  corresponding  to  results 
that  are  used  globally  throughout  the  computation. 

We  have  experimented  with  several  different  heuristics 
for  grouping  operations  into  regions.  The  optimal  strat¬ 
egy  for  grouping  instructions  into  regions  varies  with  the 
application  and  with  the  communication  limitations  of 
the  target  architecture.  However,  we  have  found  that 
even  a  relatively  simple  grain-size  adjustment  strategy 
dramatically  improves  the  performance  of  the  scheduling 
process.  For  instance,  as  illustrated  in  Figure  4,  when  a 
value  is  used  by  only  one  instruction,  the  producer  and 
consumer  of  that  value  may  be  grouped  together  to  form 
a  region,  thereby  ensuring  that  the  scheduler  will  not 
place  the  producer  and  consumer  on  different  processors 
in  an  attempt  to  use  spare  cycles  wherever  they  hap¬ 
pened  to  be  available.  Provided  that  the  maximum  re¬ 
gion  size  is  chosen  appropriately,^^  grouping  operations 

ing  heuristic  based  on  a  simulated  annealing  search  of  the 
scheduling  configuration  space. 

‘‘The  name  region  was  chosen  because  we  think  of  the 
grain-size  adjustment  technique  as  identifying  "regions”  of 
locality  within  the  data-flow  graph.  The  process  of  grain-size 
adjustment  is  closely  related  to  the  problem  of  graph  multi¬ 
section,  although  our  region-finder  is  somewhat  more  partic¬ 
ular  about  the  properties  (shape,  size,  and  connectivity)  of 
each  "region"  sub-graph  than  are  typical  graph  multisection 
algorithms. 

‘^The  region  size  must  be  chosen  such  that  the  compu- 


Figure  4:  A  Simple  Region  Forming  Heuristic.  A 
region  is  formed  by  grouping  together  operations  that 
have  a  simple  producer/consumer  relationship.  This  pro¬ 
cess  is  invoked  repeatedly,  with  the  region  growing  in  size 
as  additional  producers  are  added.  The  region-growing 
process  terminates  when  no  suitable  producers  remain, 
or  when  the  maximum  region  size  is  reached.  A  pro¬ 
ducer  is  considered  suitable  to  be  included  in  a  region  if 
it  produces  its  result  solely  for  use  by  that  region.  (The 
numbers  shown  within  each  node  reflect  the  computa¬ 
tional  latency  of  the  operation.) 

together  based  on  locality  prevents  the  scheduler  from 
making  gratuitous  use  of  the  communication  channels, 
forcing  it  to  focus  on  scheduling  options  that  make  more 
effective  use  of  the  limited  communication  bandwidth. 

An  important  aspect  of  grain-size  adjustment  is  that 
the  grain-size  is  not  increased  uniformly.  As  shown  in 
Figure  5,  some  regions  are  much  larger  than  others.  In¬ 
deed,  it  is  important  not  to  forcibly  group  non-localized 
operations  into  regions  simply  to  increase  the  grain-size. 
For  example,  it  is  likely  that  the  result  produced  by  an 
instruction  that  has  many  consumers  will  be  transmitted 
amongst  the  processors,  since  it  would  not  be  practical 
to  place  all  of  the  consumers  on  the  result-producing  pro¬ 
cessor.  In  this  case,  creating  a  large  region  by  grouping 
together  the  producer  with  only  some  of  the  consumers 
would  increase  the  grain-size,  but  would  not  reduce  inter¬ 
processor  communication,  since  the  result  would  need  to 
be  transmitted  anyway.  In  other  words,  it  only  makes 
sense  to  limit  the  scheduler’s  options  by  grouping  opera¬ 
tions  together  when  doing  so  will  reduce  inter-processor 
communication. 

Parallel  Scheduling 

Exploiting  locality  by  grouping  operations  into  regions 
forces  closely-related  operations  to  occur  on  the  same 

tational  latency  of  the  operations  grouped  together  is  well- 
matched  to  the  communication  bandwidth  limitations  of  the 
architecture.  If  the  regions  are  made  too  large,  communi¬ 
cation  bandwidth  will  be  underutilized  since  the  operations 
within  a  region  do  not  transmit  their  results. 
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Figure  5:  The  numerical  operations  in  the  9-body  pro¬ 
gram  were  divided  into  regions  based  on  locality.  This 
table  shows  how  region  size  can  vary  depending  on  the  lo¬ 
cality  structure  of  the  computation.  Region  size  is  mea¬ 
sured  by  as  measured  by  computational  latency  (cycles). 
The  program  was  divided  into  292  regions,  with  an  av¬ 
erage  region  size  of  7.56  cycles. 


processor.  Although  this  reduces  inter-processor  com¬ 
munication  requirements,  it  also  eliminates  many  op¬ 
portunities  for  parallel  execution.  Figure  6  shows  the 
parallelism  remaining  in  the  9-body  problem  after  oper¬ 
ations  have  been  grouped  into  regions.  Comparison  with 
Figure  3  shows  that  increasing  the  grain-size  eliminated 
about  half  of  the  opportunities  for  parallel  execution. 
The  challenge  facing  the  parallel  scheduler  is  to  make  ef¬ 
fective  use  of  the  limited  parallelism  that  remains,  while 
taking  into  consideration  such  factors  as  communication 
latency,  memory  traffic,  pipeline  delays,  and  allocation 
of  resources  such  as  processor  buses  and  inter-processor 
communication  channels. 

The  Supercompvier  Toolkit  compiler  schedules  oper¬ 
ations  for  parallel  execution  in  two  phases.  The  first 
phase,  known  a-s  the  region-level  scheduler,  is  primar¬ 
ily  concerned  with  coarse-grain  assignment  of  regions  to 
processors,  generating  a  rough  outline  of  what  the  final 
program  will  look  like.  The  region-level  scheduler  assigns 
each  region  to  a  processor;  determines  the  source,  des¬ 
tinations,  and  approximate  time  of  transmission  of  each 
inter-processor  message;  and  determines  the  preferred 
order  of  execution  of  the  regions  assigned  to  each  pro¬ 
cessor.  The  region-level  scheduler  takes  into  account  the 
latency  of  numerical  operations,  the  inter-processor  com¬ 
munication  capabilities  of  the  target  architecture,  the 
structure  (critical-path)  of  the  computation,  and  which 
data  values  each  processor  will  store  in  its  memory.  How¬ 
ever,  the  region-level  scheduler  does  not  concern  itself 
with  finer-grain  details  such  as  the  pipeline  structure  of 
the  processors,  the  detailed  allocation  of  each  communi¬ 
cation  channel,  or  the  ordering  of  individual  operations 
within  a  processor.  At  the  coarse  grain-size  associated 
with  the  scheduling  of  regions,  a  straightforward  set  of 
critical-path  based  scheduling  heuristics^^  have  proven 

’^The  heuristics  used  by  the  region-level  scheduler  are 
closely  related  to  list-scheduling  [8].  A  detailed  discussion  of 
the  heuristics  used  by  the  region-level  scheduler  is  presented 
in  [22]. 
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Figure  6:  Parallelism  profile  of  the  9-body  problem  after 
operations  have  been  grouped  together  to  form  regions. 
Comparison  with  Figure  3  clearly  shows  that  increasing 
the  grain-size  significantly  reduced  the  opportunities  for 
parallel  execution.  In  particular,  the  maximum  speedup 
factor  dropped  from  98  times  faster  to  only  49  times 
faster  than  a  single  processor. 
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quite  effective.  For  the  9-bocly  problem  example,  the 
computational  load  was  spread  so  evenly  that  the  varia¬ 
tion  in  utilization  efficiency  among  the  8  processors  was 
only  1%. 

The  final  phase  of  the  compilation  process  is 
instruction-level  scheduling  The  region-level  scheduler 
provides  the  instruction-level  scheduler  with  a  set  of  op¬ 
erations  to  execute  on  each  processor,  along  with  a  set 
of  preferences  regarding  the  order  in  which  those  oper¬ 
ations  should  be  computed,  and  a  list  of  results  that 
need  to  be  transmitted  among  the  processors.  The 
instruction-level  derives  low-level  pipelined  instructions 
for  each  processor,  chooses  the  exact  time  and  commu¬ 
nication  channel  for  each  inter-processor  transmission, 
and  determines  where  values  will  be  stored  within  each 
processor.  The  instruction-level  scheduler  chooses  the 
final  ordering  of  the  operations  within  each  processor, 
taking  into  account  processor  pipelining,  register  allo¬ 
cation,  memory  access  restrictions,  and  availability  of 
interprocessor-communication  channels.  Whenever  pos¬ 
sible,  the  order  of  operations  is  chosen  so  as  to  match  the 
preferences  of  the  region-level  scheduling  ph2tse.  How¬ 
ever,  the  instruction-level  scheduler  is  free  to  reorder  op¬ 
erations  as  needed,  intertwining  operations  without  re¬ 
gard  to  which  coarse-grain  region  they  were  originally  a 
member  of. 

The  instruction-level  scheduler  begins  by  performing  a 
data-use  analysis  to  determine  which  instructions  share 
data  values  and  should  therefore  be  placed  near  each 
other  for  register  allocation  purposes.  The  scheduler 
combines  the  data-use  information  with  the  instruction¬ 
ordering  preferences  provided  by  the  region-level  sched¬ 
uler  to  produce  a  scheduling  priority  for  each  instruction. 
The  scheduling  process  is  performed  one  cycle  at  a  time, 
performing  scheduling  of  a  cycle  on  all  processors  be¬ 
fore  moving  on  to  the  next  cycle.  Instructions  compete 
for  resources  based  on  their  scheduling  priority;  in  each 
cycle,  the  highest- priority  operation  whose  data  and  pro¬ 
cessor  resources  are  available  will  be  scheduled.  Due  to 
this  competition  for  data  and  resources,  operations  may 
be  scheduled  out  of  order  if  their  resources  happen  to 
be  available,  in  order  to  keep  the  processor  busy.  In¬ 
deed,  when  the  performance  of  the  instruction-scheduler 
is  measured  independently  of  the  region-scheduler,  by 
generating  code  for  a  single  VLIW  processor,  utilization 
efficiencies  in  excess  of  99.7%  are  routinely  achieved,  rep¬ 
resenting  nearly  optimal  code. 

An  cispect  of  the  scheduler  that  has  proven  to  be 
particularly  important  is  the  retroactive  scheduling  of 
memory  references.  Although  computation  instructions 
(such  j»s  -f  or  ♦)  are  scheduled  on  a  cycle-by-cycle  basis, 
memory  LOAD  instructions  are  scheduled  retroactively, 
wherever  they  happen  to  fit  in.  For  instance,  when  a 
computation  instruction  requires  that  a  value  be  loaded 
into  a  register  from  memory,  the  actual  memory  access 
operation*^  is  scheduled  in  the  past  for  the  earliest  mo- 

'^On  the  toolkit  architecture,  two  memory  operations  may 
occur  in  paradlel  with  computation  and  address-generation 
operations.  This  ensures  that  retroactively  scheduled  mem¬ 
ory  accesses  will  not  interfere  with  computations  from  previ¬ 
ous  cycles  that  have  already  been  scheduled. 


ment  at  which  both  a  register  and  a  memory-bus  cycle 
are  available;  the  memory  operation  may  occur  50  or 
even  100  instructions  earlier  than  the  computation  in¬ 
struction.  Since  on  the  Supercomputer  Toolkit,  mem¬ 
ory  operations  must  compete  for  bus  access  with  inter¬ 
processor  messages,  retroactive  scheduling  of  memory 
references  helped  to  avoid  interference  between  memory 
and  communication  traffic. 

Performance  Measurements 

The  Supercomputer  Toolkit  and  its  associated  compiler 
have  been  used  for  a  wide  variety  of  applications,  rang¬ 
ing  from  computation  of  human  genetic  pedigrees  to  the 
simulation  of  electrical  circuits.  The  applications  that 
have  generated  the  most  interest  from  the  scientific  com¬ 
munity  involve  various  integrations  of  the  N-body  grav¬ 
itational  attraction  problem.*®  Parallelization  of  these 
integrations  has  been  previously  studied  by  Miller[15], 
who  parallelized  the  program  by  using  futures  to  man¬ 
ually  specify  how  parallel  execution  should  be  attained. 
Miller  shows  how  one  can  re-write  the  N-body  program 
so  as  to  eliminate  sequential  data-structure  accesses  to 
provide  more  effective  parallel  execution,  manually  per¬ 
forming  some  of  the  optimizations  that  partial  evaluation 
provides  automatically.  Others  have  developed  special- 
purpose  hardware  that  parallelizes  the  9-body  problem 
by  dedicating  one  processor  to  each  planet. [2]  Previous 
work  in  partial  evaluation  ([3], [4], [5])  has  shown  that  the 
9-body  problem  contains  large  amounts  of  fine-grain  par¬ 
allelism,  making  it  plausible  that  more  subtle  paralleliza¬ 
tions  are  possible  without  the  need  to  dedicate  one  pro¬ 
cessor  to  each  planet. 

We  have  measured  the  effectiveness  of  coupling  partial 
evaluation  with  grain-size  adjustment  to  generate  code 
for  the  Supercomputer  Toolkit  parallel  computer,  an  ar¬ 
chitecture  that  suffers  from  serious  interprocessor  com¬ 
munication  latency  and  bandwidth  limitations.  Figure  7 
shows  the  parallel  speedups  achieved  by  our  compiler  for 
several  different  N-body  interaction  applications.  Fig¬ 
ure  9  focuses  on  the  9-body  program  (ST9)  discussed  ear¬ 
lier  in  this  paper,  illustrating  how  the  parallel  speedup 
varies  with  the  number  of  processors  used.  Note  that 
as  the  number  of  processors  increases  beyond  10,  the 
speedup  curves  level  off.  A  more  detailed  analysis  has 
revealed  that  this  is  due  to  the  saturation  of  the  inter¬ 
processor  communication  pathways,  as  illustrated  in  Fig¬ 
ure  10. 

Related  Work 

The  use  of  partial  evaluation  to  expose  parallelism 
makes  our  approach  to  parallel  compilation  fundamen¬ 
tally  different  from  the  approaches  taken  by  other  com¬ 
pilers.  IVaditionatly,  compilers  have  maintained  the 
data-structures  and  control  structure  of  the  original  pro¬ 
gram.  For  example,  if  the  original  program  represented 
an  object  as  a  doubly-linked  list  of  numbers,  the  com¬ 
piled  program  would  as  well.  Only  through  partial  eval- 

'®For  instance,  [23]  describes  results  obtained  nsing  the 
Supercomputer  Toolkit  that  prove  that  the  solar  system  is 
chaotic. 
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Figure  7 ;  Speedups  of  various  applications  running  on  8 
processors.  Four  different  computations  have  been  com¬ 
piled  in  order  to  measure  the  performance  of  the  com¬ 
piler:  a  6  particle  stormer  integration(ST€),  a  9  particle 
stormer  integration(ST9),  a  12  particle  stormer  integra- 
tion(ST12),  and  a  9  particle  fourth  order  Runge  Kutta 
integration.  Speedup  is  the  single  processor  execution 
time  of  the  computation  divided  by  the  total  execution 
time  on  the  multiprocessor. 


Figure  8:  The  result  of  scheduling  the  9-body  problem 
onto  8  Supercomputer  Toolkit  processors.  Comparison 
with  with  the  region-level  parallelism  profile  (figure  6) 
illustrates  how  the  scheduler  spread  the  course-grain  peir- 
allelism  across  the  processors.  A  total  of  340  cycles  are 
required  to  complete  the  computation.  On  average,  6.5 
of  the  8  processors  are  utilized  during  each  cycle. 


Figure  9:  Speedup  graph  of  Stormer  integrations.  Am¬ 
ple  specdupt  are  available  to  keep  the  8-processor  Sti- 
percomputer  Toolkit  busy,  However,  the  incremental  im¬ 
provement  of  using  more  than  10  processors  is  relatively 
small. 
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Figure  10;  Utilization  of  the  inter-processor  communi¬ 
cation  pathways.  The  communication  system  becomes 
saturated  at  around  10  processors.  This  accounts  for 
the  lack  of  incremental  improvement  available  from  us¬ 
ing  more  than  10  processors  that  was  seen  in  Figure  9. 


uation  can  the  data-structures  used  by  the  program¬ 
mer  to  think  about  the  problem  be  removed,  leaving 
the  compiler  free  to  optimize  the  underlying  numerical 
computation,  unhindered  by  sequentially-accessed  data- 
structures  and  procedure  calls. 

Many  compilers  for  high-perform2mce  architectures 
use  program  transformations  to  exploit  low-level  paral¬ 
lelism.  For  instance,  compilers  for  vector  machines  un¬ 
roll  loops  to  help  fill  vector  registers. [18]  Other  paral¬ 
lelization  techniques  include  trace-scheduling,  software 
pipelining,  vectorizing,  as  well  as  static  and  dynamic 
scheduling  of  data-flow  graphs. 

Trace  Scheduling 

Compilers  that  exploit  fine-grain  parallelism  often  em¬ 
ploy  trace  scheduling  techniques  [9]  to  guess  which  way  a 
branch  will  go,  allowing  computations  beyond  the  branch 
to  occur  in  parallel  with  those  that  precede  the  branch. 
Our  approach  differs  in  that  we  use  partial  evaluation  to 
take  advantage  of  information  about  the  specific  applica¬ 
tion  at  hand,  allowing  us  to  totally  eliminate  many  data- 
independent  branches,  producing  basic-blocks  on  the  or¬ 
der  of  several  thousands  of  instructions,  rather  than 
the  10-30  instructions  typically  encountered  by  trace- 
scheduling  based  compilers.  An  interesting  direction  for 
future  work  would  be  to  add  trace-scheduling  to  our  ap¬ 
proach,  to  optimize  across  the  data-dependent  branches 
that  occur  at  basic-block  boundaries. 

Most  trace-scheduling  based  compilers  use  a  variant 
of  List-scheduling[8]  to  parallelize  operations  within  an 


individual  basic-block.  Although  list-scheduling  using 
critical-path  based  heuristics  is  very  effective  when  the 
grain-size  of  the  instructions  is  well-matched  to  the  inler- 
processor  communication  bandwidth,  we  have  found 
that  in  the  case  of  limited  bandwidth,  a  grain-size  ad¬ 
justment  phase  is  required  to  make  the  list-scheduling 
approach  effective. 

Software  Pipeliumg 

Software  Pipelining  [1 1]  optimizes  a  particular  fixed  size 
loop  structure  such  that  several  iterations  of  the  loop 
are  started  on  different  processors  at  constant  intervals 
of  time.  This  increases  the  throughput  of  the  compu¬ 
tation.  The  effectiveness  of  software  pipelining  will  be 
determined  by  whether  the  grain-size  of  the  parallelism 
expressed  in  the  looping  structure  employed  by  the  pro¬ 
grammer  matches  the  architecture:  software  pipelining 
can  not  parallelize  a  computation  that  has  its  parallelism 
hidden  behind  inherently  sequential  data  references  and 
spread  across  multiple  loops  The  partial  evaluation  ap¬ 
proach  on  such  a  loop  structure  would  result  in  the  loop 
being  completely  unrolled  with  all  of  the  sequential  dala- 
structure  references  removed  and  all  of  the  fine  grain 
parallelism  in  the  loop's  computation  exposed  and  avail¬ 
able  for  parallelization.  In  some  applications,  especially 
those  involving  partial  differential  equations,  fully  un¬ 
rolling  loops  may  generate  prohibitively  large  programs. 
In  these  situations,  partial  evaluation  could  be  used  to 
optimize  the  innermost  loops  of  a  computation,  with 
techniques  such  as  software  pipelining  used  to  handle 
the  outer  loops. 

Vectorizing 

Vectorizing  is  a  commonly  used  optimization  for  vec¬ 
tor  supercomputers,  executing  operations  on  each  vec¬ 
tor  element  in  parallel.  This  technique  is  highly  ef¬ 
fective  provided  that  the  computation  is  composed  pri¬ 
marily  of  readily  identifiable  vector  operations  (such  as 
matrix  multiply).  Most  vectorizing  compilers  generate 
vector  code  from  a  scalar  specification  by  recognizing 
certain  standard  looping  constructs.  However,  if  the 
source  program  lacks  the  necessary  vector-accessing  loop 
structure,  the  programs  do  very  poorly.  For  computa¬ 
tions  that  are  mostly  data-independent,  the  combina¬ 
tion  of  partial  evaluation  with  static  scheduling  tech¬ 
niques  has  the  potential  to  be  vastly  more  effective  than 
vectorization.  Whereas  a  vectorizing  compiler  will  of¬ 
ten  fail  simply  because  the  computation’s  structure  does 
not  lend  itself  to  a  vector-oriented  representation,  the 
partial-evaluation/static  scheduling  approach  can  often 
sut,i.eed  by  making  use  of  very  fine-grained  parallelism. 
On  the  other  hand,  for  computations  that  are  highly 
data-dependent,  or  which  have  a  highly  irregular  struc¬ 
ture  that  makes  unrolling  loops  infeasible,  vectorizing 
remains  an  important  option. 

Iterative  Restructuring 

Iterative  restructuring  represents  the  manual  approach 
to  parallelization.  Programmer’s  write  and  rewrite  their 
code  until  the  parallelizer  is  able  to  automatically  rec¬ 
ognize  and  utilize  the  available  parallelism.  There  are 


many  utilities  for  doing  this,  some  of  which  are  discussed 
in  [7].  This  approach  is  not  flexible  in  that  whenever  one 
aspect  of  the  computation  is  changed,  one  must  ensure 
that  parallelism  in  the  changed  computation  is  fully  ex¬ 
pressed  by  the  loop  and  data-reference  structure  of  the 
program. 

Static  Scheduling 

Static  scheduling  of  the  fine-grained  pa,  -uielism  embed¬ 
ded  in  large  basic-blocks  has  also  a'  "  been  investigated 
for  use  on  the  Oscor  architecture  a*  Waseda  University  in 
Japan. [12].  The  Oscar  compiler  uses  a  technique  called 
task  fusion  that  is  similar  in  spirit  to  the  grain-size  ad¬ 
justment  technique  used  on  the  Supercomputer  Toolkit. 
However,  the  Oscar  compiler  lacks  a  partial-evaluation 
phase,  leav'*ig  u  to  the  programmer  to  manually  gen¬ 
erate  large  ‘  jsic  blocks.  Although  the  manual  creation 
of  huge  basic-blocks  (or  of  automated  program  genera¬ 
tors)  may  be  practical  for  computations  such  as  an  FFT 
that  have  a  very  regular  structure,  this  is  not  a  reason¬ 
able  alternative  for  more  complex  programs  that  require 
abstraction  and  complex  data-structure  representations. 
For  example,  imagine  writing  out  the  11,000  floating¬ 
point  operations  for  the  Stormer  integration  of  the  So¬ 
lar  system  and  then  suddenly  realizing  that  you  need 
to  change  to  a  different  integration  method.  The  man¬ 
ual  coder  would  grimace,  whereas  a  programmer  writing 
code  for  a  compiler  that  uses  partial-evaluation  would 
simply  alter  a  high-level  procedure  call.  It  appears  that 
the  compiler  for  Oscar  could  benefit  a  great  deal  from 
the  use  of  partial  evaluation. 

Conclusions 

Partial  evaluation  has  an  important  role  to  play  in  the 
parallel  compilation  process,  especially  for  largely  data- 
independent  programs  such  as  those  associated  with 
numerically-oriented  scientific  computations.  Our  ap¬ 
proach  of  adjusting  the  grain-size  of  the  computation  to 
match  the  architecture  was  possible  only  because  of  par¬ 
tial  evaluation:  If  we  had  taken  the  more  conventional 
approach  of  using  the  structure  of  the  program  to  detect 
parallelism,  we  would  then  be  stuck  with  the  grain-size 
provided  us  by  the  programmer.  By  breaking  down  the 
program  structure  to  its  finest  level,  and  then  imposing 
our  own  program  structure  (regions)  based  on  locality 
of  reference,  we  have  the  freedom  to  choose  the  grain- 
size  to  match  the  architecture.  The  coupling  of  partial 
evaluation  with  static  scheduling  techniques  in  the  Su¬ 
percomputer  Toolkit  compiler  has  allowed  scientists  to 
write  programs  that  reflect  their  way  of  thinking  about 
a  problem,  eliminating  the  need  to  write  programs  in  an 
obscure  style  that  makes  parallelism  more  apparent. 
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Appendix:  Architecture  of  the 
Supercomputer  Toolkit 

The  Supercomputer  Toolkit  is  a  MIMD  computer.  It 
consists  of  eight  separate  VLIW(Very  Long  Instruction 
Word)  processors  and  a  configurable  interconnection  net¬ 
work.  A  detailed  review  of  the  Supercomputer  Toolkit  ar¬ 
chitecture  may  be  found  in  [1],  Each  toolkit  processor 
has  two  bi-directional  communication  ports  that  may 
be  connected  to  form  various  communication  topologies. 
The  parallelizing  compiler  is  targeted  for  a  configuration 
in  which  all  of  the  processors  are  interconnected  by  two 
independent  shared  communication  buses.  The  proces¬ 
sors  operate  in  lock-step,  synchronized  by  a  master  clock 
that  ensures  they  begin  each  cycle  at  the  same  moment. 
Each  processor  has  its  own  program-counter,  allowing 
independent  tasks  to  be  performed  by  each  processor.  A 
single  “global”  condition  flag  that  spans  the  8-processors 
provides  the  option  of  having  the  individual  processors 
act  together  so  as  to  emulate  a  ULIW  (ultra-long  in¬ 
struction  word)  computer. 
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The  Toolkit  Processor 

Figure  11  shows  the  architecture  of  each  processor.  The 
design  is  symmetric  and  is  intended  to  provide  the 
memory  bandwidth  needed  to  take  full  advantage  of 
instruction-level  parallelism.  Each  processor  has  a  64- 
bit-floating-point  chip  set,  a  five-port  32x64-bit  register 
file,  two  separately  addressable  data  memories,  two  in- 


Figure  11;  This  is  the  overall  architecture  of  a  Supercom¬ 
puter  Toolkit  processor  node,  consisting  of  a  fast  floating¬ 
point  chip  set,  a  5-port  register  file,  two  memories,  two  inte¬ 
ger  alu  address  generators,  and  a  sequencer. 

teger  processors'®  for  memory  address  generation,  two 
I/O  ports,  a  sequencer,  and  a  separate  instruction  mem¬ 
ory.  The  processor  is  pipelined  and  is  thus  capable  of 
initiating  the  following  instructions  in  parallel  during 
each  clock  cycle,  a  left  memory-I/0  operation,  a  right 
memory-1/0  operation,  a  FALU  operation,’^  a  FMUL 
operation'®,  and  a  sequencer  operation.'®  The  com¬ 
piler  takes  full  advantage  of  the  architecture,  scheduling 
computation  instructions  in  parallel  with  memory  op¬ 
erations  or  communication.  The  Toolkit  is  completely 
synchronous  and  clocked  at  12.5  Mhz.  When  both  the 
FALU  and  FMUL  are  utilized,  the  Toolkit  is  capable 
of  a  peak  rate  of  200  Megaflops,  25  on  each  board.  The 
compiler  typically  achieves  approximately  1/2  of  this  ca¬ 
pability  because  it  does  not  attempt  to  simultaneously 
utilize  both  the  FMUL  and  the  FALU.®® 

The  compiler  allocates  two  of  the  32  registers  for 
communication  purposes  (data  buffering),  while  3  reg¬ 
isters  are  reserved  for  use  by  the  hardware  itself.  Thus 

'®Each  memory  address  generator  processor  consists  of  an 
integer  processor  tied  closely  to  a  local  register  file. 

'^The  FALU  is  capable  of  doing  integer  operations, 
most  floating-point  operations,  and  many  other  one  cycle 
operations 

'*The  FMUL  is  capable  of  doing  floating-point  multiplies(l 
cycle  latency),  floating-point  division(5  cycle  latency),  and 
floating-point  square  roots(9  cycle  latency)  as  well  as  many 
other  operations. 

'®The  sequencer  contains  a  small  local  memory  for  han¬ 
dling  stack  operations. 

Simultaneous  utilization  of  the  FMUL  and  FALU  is  only 
occasionally  worthwhile  for  long  multiply-accumulate  opera¬ 
tions.  Since  the  FMUL  and  FALU  share  their  register-file 
ports,  opportunities  lor  making  simultaneous  use  of  both 
units  are  rare. 


26  registers  are  available  for  use  by  scheduled  compu¬ 
tations.  The  floating-point  chips  have  a  three  stage 
pipeline  whereby  the  result  of  an  operation  initiated  on 
cycle  N  will  be  available  in  the  output  latch  on  cycle 
1  +  N  +  L,  where  L  is  the  latency  of  the  computation. 
The  result  can  then  be  moved  in  the  register-file  during 
any  of  the  following  cycles,  until  the  result  is  moved  into 
the  output  latch.  There  are  feedback  (pipeline  bypass) 
paths  in  the  floating-point  pipeline  that  allow  computed 
results  to  be  fed  back  for  use  as  operands  in  the  next 
cycle.  The  compiler  takes  advantage  of  these  feedback 
mechanisms  to  reduce  register  utilization. 

The  bus  that  connects  the  memory,  I/O  port,  and 
register-file  is  a  resource  bottleneck,  allowing  either  a 
memory  load,  a  memory  store,  an  I/O  transmission,  or 
an  I/O  reception  to  be  scheduled  during  each  cycle.  This 
bus  appears  twice  in  the  architecture,  in  each  of  the  two 
independent  memory/1-0  subsystems. 

Interconnection  Network  and 
Communication 

The  toolkit  allows  for  flexible  interconnection  among  the 
boards  through  its  two  I/O  ports.  The  interconnec¬ 
tion  scheme  is  not  fixed  and  many  configurations  are 
possible,  although  changing  the  configuration  requires 
manual  insertion  of  connectors.  The  compiler  currently 
views  this  network  as  two  separate  buses;  a  left  and  a 
right  bus.  Each  processor  is  connected  to  both  buses 
through  its  left  and  right  I/O  ports.  This  configuration 
was  chosen  as  the  one  that  would  place  the  fewest  local¬ 
ity  restrictions  on  the  types  of  programs  that  could  be 
compiled  efficiently.  However,  targeting  the  compiler  for 
other  configurations,  such  as  a  single  shared  bus  on  the 
left  side,  with  pairwise  connections  between  processors 
on  the  right  side,  may  prove  advantageous  for  certain 
applications.  Each  transmission  requires  two  cycles  to 
complete.  Thus  in  the  two  shared-bus  8-processor  con¬ 
figuration,  only  one  out  of  every  eight  results  may  be 
transmitted.  Pipeline  latencies  introduce  a  six  cycle  de¬ 
lay  between  the  time  that  a  value  produced  on  one  pro¬ 
cessor  is  available  for  use  by  the  floating-point  unit  of 
another  processor. 

The  hardware  permits  any  processor  to  transmit  a 
value  at  any  time,  relying  on  software  to  allocate  the 
communication  channels  to  a  particular  processor  for  any 
given  cycle.  Once  a  value  is  transmitted,  each  receiving 
processor  must  explicitly  issue  a  “receive”  instruction 
one  cycle  after  the  transmission  occurred.  The  compiler 
allocates  the  communication  pathways  on  a  cycle  by  cy¬ 
cle  basis,  automatically  generating  the  appropriate  send 
and  receive  instructions. 
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